System and method for instruction-level parallelism in a programmable multiple network processor environment

ABSTRACT

A system and method process data elements with instruction-level parallelism. An instruction buffer holds a first instruction and a second instruction, the first instruction being associated with a first thread, and the second instruction being associated with a second thread. A dependency counter counts satisfaction of dependencies of instructions of the second thread on instructions of the first thread. An instruction control unit is coupled to the instruction buffer and the dependency counter, the instruction control unit increments and decrements the dependency counter according to dependency information included in instructions. An execution switch is coupled to the instruction control unit and the instruction buffer, and the execution switch routes instructions to instruction execution units.

CROSS-REFERENCE TO RELATED APPLICATIONS

The present invention is related to patent applications “System AndMethod For Processing Overlapping Tasks In A Programmable NetworkProcessor Environment” (Ser. No. 09/833,581) and “System and Method forData Forwarding in a Programmable Multiple Network ProcessorEnvironment” (Ser. No. 09/833,578), both of which are incorporatedherein by reference.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates generally to digital computing. Morespecifically, the present invention relates to network processors forprocessing network data elements.

2. Discussion of the Related Art

Network switches and routers, or network switch elements, form thebackbone of digital networks, such as the Internet. Network switchelements connect network segments by receiving network data from ingressnetwork segments and transferring the network data to egress networksegments. Because large telecommunications switching facilities andcentral offices aggregate network traffic from extensive networks andmany network segments, they require high-speed and high-availabilityswitches and routers.

Network switch elements select the egress network segment by processingthe address or destination included in the network data according tonetwork data processing program logic. Traditionally, network switchelements included Application Specific Integrated Circuits (ASICs) thatprovided the program logic. Because ASICs are “hard-coded” with programlogic for handling network traffic, they provide the high speednecessary to process a large volume of network data. ASICs, however,make it difficult to upgrade or reconfigure a network switch element,and it is expensive to design and fabricate a new ASIC for each new typeof network rig switch element.

In response to these drawbacks, manufacturers of network switch elementsare turning to programmable network processors to enable network switchelements to process network data. Programmable network processorsprocess network data according to program instructions, or software,stored in a memory. The software allows manufacturers and users todefine the functionality of the network switch elements-functionalitythat can be altered and changed as needed. With programmable networkprocessors, manufacturers and users can change the software to respondto new services quickly, without costly system upgrades, as well asimplement new designs quickly.

To the extent that there is a drawback to the use of programmablenetwork processors in network switch elements, that drawback relates tospeed. Because programmable network processors process network datausing software, they are usually slower than a comparable hard-codedASIC. One of the major design challenges, therefore, is developingprogrammable network processors fast enough to process the large volumeof network data at large telecommunications switching facilities.

One technique used to increase speed in traditional processor design is“instruction-level parallelism,” or processing multiple threads ofinstructions on a processing element in parallel. However, traditionalinstruction-level parallelism techniques are either highly complex, orwould introduce unacceptable delays and timing problems into theprocessing of network data, which must be processed on a time criticalbasis.

SUMMARY OF THE INVENTION

The present invention provides a system and method for processinginformation using instruction-level parallelism. In the system, aninstruction buffer holds a first instruction and a second instruction,the first instruction being associated with a first thread, and thesecond instruction being associated with a second thread. In thissystem, one or more instructions from the second thread may be dependenton the execution of one or more instructions in the first thread. Adependency counter is used to record dependencies of instructionsbetween the first thread and the second thread. An instruction controlunit is coupled to the instruction buffer and the dependency counter,the instruction control unit increments and decrements the dependencycounter on the basis of information in the instructions. An executionswitch is coupled to the instruction control unit and the instructionbuffer, the execution switch sends instructions to an execution unit.

In the method, a first instruction associated with a first thread isloaded on a processing element. The processing element determines thatexecution of a second instruction depends on the execution of the firstinstruction, where the second instruction is associated with a secondthread. A dependency counter associated with the second thread isincremented if the processing element determines that execution of asecond instruction depends on the execution of the first instruction.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention is described with reference to the accompanyingdrawings. In the drawings, like reference numbers indicate identical orfunctionally similar elements. Additionally, the left-most digit(s) of areference number identifies the drawing in which the reference numberfirst appears.

FIG. 1 illustrates a system block diagram of a data communicationssystem.

FIG. 2 illustrates a system block diagram of a programmable networkprocessor.

FIG. 3 illustrates a system block diagram of a multiprocessor core.

FIG. 4 illustrates a system block diagram of an exemplary processingelement.

FIG. 5 is a diagram illustrating concurrent processing of three threadsof instructions.

FIG. 6 illustrates concurrent processing of two threads of instructions.

FIG. 7 illustrates dependency counter groups.

FIG. 8 illustrates an exemplary instruction.

FIG. 9 illustrates an exemplary process for executing instructions.

FIG. 10 illustrates an exemplary process for executing instructions

DETAILED DESCRIPTION

Exemplary embodiments of the invention are discussed in detail below.While specific implementations are discussed, it should be understoodthat this is done for illustrative purposes only. A person skilled inthe relevant art will recognize that other components and configurationsmay be used without parting from the spirit and scope of the invention.

Programmable network processors offer a number of advantages includingflexibility, low cost, maintenance ease, decreased time to market, andincreased service life. It is difficult, however, to develop aprogrammable network processor capable of meeting the demand forever-increasing speed. One technique for increasing the speed of aprogrammable network processor is instruction-level parallelism. Ininstruction-level parallelism, threads of parallel programs can executeconcurrently on a single processing element. Instruction-levelparallelism allows a processing element to continue processinginstructions, even if one or more threads are waiting for long-latencyoperations to complete.

One problem with instruction-level parallelism is maintainingsynchronization of dependent instructions between the threads running ona processing element. Often, an instruction in one thread is dependenton the execution of instructions in another thread. Examples ofinstruction dependency are control dependency (i.e., the execution ofone instruction is conditioned on the execution of another) and datadependency (i.e., one instruction uses the results of the execution ofanother instruction). Unfortunately, conventional techniques forsynchronizing the execution of instructions among multiple threads donot lend themselves to programmable network processor applications.Conventional techniques introduce significant delays to processing,delays that are unsuitable for processing time critical network dataelements.

The present invention is directed to a system and method forsynchronizing the execution of multiple threads of instructions on asingle processing element at high speed. An instruction in a firstthread can include dependence indicators, such as a bit or bits, thatindicate dependence of the instruction on the execution of a secondthread. When a processing element encounters an instruction thatincludes dependence indicators that indicate dependence between threads,the processing element checks, decrements, or increments one or moredependency counters that record satisfaction of dependencies betweeninstructions and threads. If a dependency indicator indicates that aninstruction in a first thread is dependent upon the execution of asecond thread, a dependency counter is checked. If the dependencycounter is not above a threshold, the processing element suspends theexecution of the first thread until the dependency counter isincremented by a second thread to above the threshold. This allows theprocessing element to maintain synchronized execution of dependentinstructions between threads in a highly efficient manner. It should berecognized that the concepts described below are not restricted toprocessing network data elements but are extensible to a generic form ofdata processing. Prior to discussing the features of the presentinvention, a brief description of a data communications system isprovided.

FIG. 1 illustrates a block diagram of a network data communicationssystem, according to an embodiment of the present invention. Datacommunications system 100 can be, for example, of the type used bynetwork service providers and telecommunication carriers to providevoice and data communications services to consumers. Data communicationssystem 100 includes network 102, network line modules 104 ₁–104 _(N),and switch fabric 106. Note that a subscript “N” in the figures denotesa plurality of elements generally, and not a specific number or equalityof number between different elements with a subscript “N.”

Network 102 is connected to network line modules 104 ₁–104 _(N) which,in turn, are connected to switch fabric 106. Although datacommunications system 100 is shown as including physical connectionsbetween the various components, other configurations are possible, suchas wireless connections. Connections between network 102, network linemodules 104 ₁–104 _(N), and switch fabric 106 can be, for example,wireless data connections, data over copper, fiber optic connections(e.g., OC-48, OC-192, OC-768), or other data communications connectionsas would be apparent.

Network line modules 104 ₁–104 _(N) send and receive network dataelements to (from) network 102. Network line modules 104 ₁–104 _(N)process the network data elements and communicate the process networkdata elements with switch fabric 106. Network data elements are signalscarrying information including communications information. Examples ofnetwork data elements are asynchronous transfer mode (“ATM”) cells,Frame Relay frames, Internet Protocol (“IP”) packets, etc., and portions(segments) of these. Processing includes the concepts of performing acalculation or manipulation involving a network data element. Processingcan include, for example, determining the next hop or egress port towhich the network data element Star should be routed, networkmanagement, such as traffic shaping or policing, network monitoring,etc. Network 102 is a network for communicating network data elements.Network 102 can be, for example, the Internet, a telecommunications datanetwork, an intranet, an extranet, a voice over data communicationsnetwork, etc., and combinations thereof.

For descriptive clarity, operation of data communication system 100 isdescribed in terms of network line module 104 ₁. Network line module104, includes network line module ingress port 108, network line moduleegress port 110, and programmable network processors 112 ₁–112 ₂. Notethat the configuration of network line modules 104 ₁–104 _(N) is shownfor illustrative purposes only, and alternate configurations for networkline modules 104 ₁–104 _(N) are possible. Alternate configurationsinclude, for example, single or additional programmable networkprocessors per network line module, additional network line moduleingress ports, multiple egress ports, additional connections to network102, etc.

Network line module 1041 104₁ receives network data elements fromnetwork 102 at network line module ingress port 108. Programmablenetwork processor 112 ₁ receives network data elements from network linemodule ingress port 108. Programmable network processor 112 enablesnetwork line module 1041 104₁ to process the received network dataelements. Programmable network processor 112 ₁ provides the network dataelements to switch fabric 106 after processing.

Switch fabric 106 includes switch fabric ingress ports 114 ₁–114 _(N)and switch fabric egress ports 116 ₁–116 _(N). Switch fabric ingressports 114 ₁–114 _(N) receive data from network line modules 104 ₁–104_(N) and switch fabric egress ports 116 ₁–116 _(N) ports provide data tonetwork line modules 104 ₁–104 _(N). Switch fabric 106 outputs networkdata elements received from network processor 112 ₁ on the desiredswitch fabric egress port 116 ₁–116 _(N). Network line module 104 ₁receives processed network data elements from switch fabric egress port116 ₁ and performs additional processing, as required, and transmits thenetwork data element to network 102 via network line module egress port110. Note that network line module ingress port 108, network elementegress port 110, switch fabric ingress ports 114 ₁–114 _(N), and switchfabric egress ports 116 ₁–116 _(N) are logical representations ofphysical devices, and other combinations, such as single ports thattransmit and receive network data elements, are possible.

FIG. 2 illustrates a system block diagram of a programmable networkprocessor, according to an embodiment of the present invention.Programmable network processor 200 can be considered an exemplaryembodiment of both ingress and egress programmable network processors112 ₁–112 _(N), as described above. Programmable network processor 200includes memory controller 204, input interface 206, multiprocessor core202, and output interface 208. Multiprocessor core 202 is connected toinput interface 206, output interface 208, and memory controller 204.Note that the particular configuration, number, and type of elements ofprogrammable processor 200 are shown for illustrative purposes only andother configurations of programmable network processor 200 are possibleas would be apparent.

For the purposes of this description, it is presumed that theprogrammable network processor 200 of FIG. 2 corresponds to programmablenetwork processor 112 ₁. In operation, such a programmable networkprocessor 200 receives network data elements from network line moduleingress port 108 via input interface 206. Input interface 206 receivesthe network data elements and provides them to multiprocessor core 202for processing as described above. Multiprocessor core 202 processes thenetwork data elements and provides the result to output interface 208.Output interface 208 receives processed network data elements frommultiprocessor core 202 and forwards them to switch fabric 106 forrouting. Multiprocessor core 202 accesses storage located offprogrammable network processor 200 via memory controller 204.

Multiprocessor core 202 is connected to host control processor 210. Hostcontrol processor 210 provides network management logic and informationfor programmable network processor 200. Such network management logicand information includes, for example, generating and receiving networkdata elements for controlling switch fabric 106, network line modules104 ₁–104 _(N) and other network components. Host control processor 210performs other functions, such as generating network data elements forswitch fabric control, setting up network connections and loadingprograms into multiprocessor core 202 for operation.

FIG. 3 illustrates a system block diagram of a multiprocessor core,according to an embodiment of the present invention. Multiprocessor core300 is an exemplary embodiment of multiprocessor core 202, as describedabove. Although multiprocessor core 300 can be used for a generic formof data processing, multiprocessor core 300 can also be of the typeemployed in data communications system 100. Multiprocessor core 300includes processing elements (PE) 302 ₁–302 _(N), data memories (DM) 304₁–304 _(N), program memories (PM) 306 ₁–306 _(N), intraswitch 314, andhost control interface 308. Processing elements 302 ₁–302 _(N) areconnected to program memories 306 ₁–306 _(N), and intraswitch 314. Datamemories 304 ₁–304 _(N) are connected to intraswitch 314. Programmemories 306 ₁–306 _(N) are connected to processing elements 302 ₁–302_(N) and intraswitch 314. Host control interface 308 is connected tointraswitch 314. Intraswitch 314 is connected to on-chip peripheralunits 310 and 312. Examples of on-chip peripheral units 310 and 312 areinput interface 206, output interface 208, and memory controller 204 ofFIG. 2.

Processing elements 302 ₁–302 _(N) process network data elements,thereby providing the processing functionality for multiprocessor core300. Processing elements 302 ₁–302 _(N) execute program instructionsfrom program memories 306 ₁–306 _(N), and load and store data in datamemories 304 ₁–304 _(N). Each of processing elements 302 ₁–302 _(N) canprocess multiple threads of instructions concurrently, according to anembodiment of the present invention.

Program memories 306 ₁–306 _(N) and data memories 304 ₁–304 _(N) providedata storage functionality for the various elements of multiprocessorcore 300. Program memories 306 ₁–306 _(N) store program instructions forthe processing of network data elements by processing elements 302 ₁–302_(N). Although FIG. 3 depicts groups of four processing elementsdirectly connected to one of program memories 306 ₁–306 _(N), otherconfigurations connecting program memory to processing elements arepossible, including for example, additional processing elements orprogram memories as would be apparent. Data memories 304 ₁–304 _(N)provide on-chip storage for data, such as intermediate-results data fromprocessing network data elements, for the operation of processingelements 302 ₁–302 _(N).

Intraswitch 314 enables communication between the various components ofmultiprocessor core 300. For example, processing elements 302 ₁–302 _(N)access data memories 304 ₁–304 _(N) through intraswitch 314. Intraswitch314 can be, for example, a switching fabric in multiprocessor core 300,or individual trace connections in multiprocessor core 300. Host controlinterface 308 connects multiprocessor core 300 to host control processor210. Multiprocessor core 300 is connected to on-chip peripheral units310 and 312 via intraswitch 314.

In operation, multiprocessor core 300 receives network data elementsfrom on-chip peripheral units 310 and 312. Processing elements 302 ₁–302_(N) receive the network data elements and process them according to theprograms stored as instructions in program memories 306 ₁–306 _(N). Theintermediate results and final results of the processing operations arestored in data memories 304 ₁–304 _(N). After a network data element hasbeen processed, it is sent to on-chip peripheral unit 310 and 312.

FIG. 4 illustrates a system block diagram of an exemplary processingelement, according to an embodiment of the present invention. Processingelement 400 is an example of one of the processing elements shown inFIG. 3, and can be employed in a generic form of data processing or canbe of the type that is employed in data communications system 100.

Moreover, exemplary processing element 400 is an instruction-levelparallel processing element, in which two or more threads of parallelprograms execute concurrently. Processing element 400 can, therefore,maintain a high utilization under conditions where the processingelement would otherwise idle waiting for long-latency operations tocomplete. Note that processing element 400 is provided for illustrativepurposes only and that other processing element configurations arepossible.

Processing element 400 includes instruction fetch unit 402, instructionbuffers 404A, 404B, 404C, and 404D. Processing element 400 also includesfunction decode and execution switch 406, dependency counters 410,instruction issue control 408, memory/peripheral interface unit 4; 2412, primary function unit 414, auxiliary function unit 416, andregister file 418. Note that although dependency counters 410 are shownas being part of instruction issue control 408, other configurations arepossible. For example, dependency counters 410 can also be connected to,but not part of, instruction issue control 408.

Instruction fetch unit 402 is connected to each of instruction buffers404A–404D. Each of the connections between fetch unit 402 andinstruction buffers 404A–404D provides a path for instructions from aprogram thread. Instruction buffers 404A–404D are, in turn, connected tofunction decode and execution switch 406. Instruction buffers 404A–404Dare also connected to instruction issue control 408. Instruction issuecontrol 408 is connected to function decode and execution switch 406.Function decode and execution switch 406 is connected to memoryperipheral interface unit 412, primary function unit 414, and auxiliaryfunction unit 416. Memory peripheral interface unit 412, primaryfunction unit 414, and auxiliary function unit 416 are also referred toherein as execution units 412–416. Memory peripheral interface unit isconnected to intraswitch 314, and register file 418. Primary functionunit 414 is connected to register file 418. Auxiliary function unit 416is connected to register file 418.

Register file 418 includes read ports 420 and write port 422. Read ports420 allow execution units 412–416 to read data from the variousregisters in register file 418. Write port 422 allows execution units412–416 to write data to register file 418.

Exemplary processing element 400 is shown as supporting four concurrentthreads of instructions. Instruction fetch unit 402 fetches instructionsfrom program memory 306. The instructions are entered in the fourinstruction buffers 404A–404D according to the program thread theybelong to. Each of instruction buffers 404A–404D is associated with oneof four threads. For descriptive clarity, the convention of associatingthread 0 (T0) with instruction buffer 404A, thread 1 (T1) withinstruction buffer 404B, thread 2 (T2) with instruction buffer 404C, andthread 3 (T3) with instruction buffer 404D is adopted.

Function decode and execution switch 406 receives the instructionsassociated with the four threads from instruction buffers 404A–404D.Function decode and execution switch 406 provides the instructions toexecution units 412–416.

FIG. 5 is a diagram illustrating concurrent processing of three threadsof instructions. Instruction processing diagram 500 illustrates theproblem of instruction synchronization between multiple threads. Theinstructions of one thread can be dependent on the results ofinstructions in another thread. For example, the contents of a registerthat is set by a first instruction in one thread can be used by a secondinstruction in another thread. In such a case, if the first instructionis not executed before the second instruction, the register will notinclude data valid for the execution of the first second instruction.These types of problems are referred to as synchronization problems, andmay result in a program execution error.

Instruction processing diagram 500 shows three threads of instructions,thread 502, thread 504, and thread 506. Threads 502–506 can be of thetype employed in a generic form of data processing or can be of the typethat are employed in data communications system 100. Note that threethreads are shown for descriptive clarity only, and other configurationsare possible. A processing element can process as few as two threads,and as many threads as is accommodated by a processing elementarchitecture. For example, processing element 400 accommodates fourconcurrent threads of instructions.

Each of threads 502–506 is shown including two instructions. Thread 502includes instruction 508 (i1) and instruction 510 (i2). Thread 504includes instruction 512 (i3) and instruction 514 (i4). Thread 506includes instruction 516 (i5) and instruction 518 (i6). Note thatinstruction processing diagram 500 shows two instructions per thread fordescriptive clarity only, and other configurations are possible. Forexample, each of threads 502–506 can include additional instructions(not shown) before the first instruction (e.g., instruction 508 inthread 502), between the first and second instruction (e.g.,instructions 508 and 510 in thread 502), and after the secondinstruction (e.g., instruction 510 in thread 502). Threads 502–506 caninclude as many instructions as are required to perform generic dataprocessing or perform processing for data communications system 100.

Generally, a processing element processes the three threads by executingtheir respective instructions. Instruction processing diagram 500 showsinstruction execution proceeding from left to right, and the relativespacing of instructions indicates when an instruction is being executed.For example, instruction processing diagram 500 shows instruction 508 isexecuted before instruction 510 of thread 502. Note also thechronological relationships between instructions of different threads.For example, the processing element executes instruction 508 of thread502 before instruction 512 of thread 504, and instruction 512 beforeinstruction 516 of thread 506.

Additionally, instruction processing diagram 500 shows the dependencybetween the instructions of threads 502–506. Dependency is when theexecution of a second instruction is conditional on the execution of afirst instruction. Consider, for example, a situation in which a firstinstruction in a first thread writes a value to a register file, such asregister file 418, and a second instruction in a second threadsubsequently reads the value from the register file and uses the valueas an operand in a calculation. In this situation, the first instructionis referred to as the dependee instruction, and the second instructionis referred to as the dependent instruction. A dependent instruction isan instruction that must not be executed before the instruction on whichit depends. A dependee instruction is an instruction on which adependent instruction depends. As long as the dependee instruction isexecuted before the dependent instruction, the register file includesthe correct value for the execution of the dependent instruction.

Depends indicators 520–526 are used to show dependencies between theinstructions of threads 502–506. Depends indicators are drawn from adependent instruction to a dependee instruction (i.e., the arrow of thedepends indicator points to the dependee instruction). Depends indicator520 indicates that the execution of instruction 512 depends on theexecution of instruction 508. Depends indicator 522 indicates that theexecution of instruction of 510 depends on the execution of instruction514. Depends indicator 524 indicates that the execution of instruction516 depends on the execution of instruction 510. Depends indicator 526indicates that the execution of instruction 518 is dependent on theexecution of instruction 514.

As described above, if a first instruction depends on a second, earlierexecuted, instruction, processing may proceed normally. Instructionprocessing diagram 500 shows instruction 512 and instruction 516dependent on earlier executed instructions. Program errors may occur,however, if a first instruction depends on a later executed instruction.Instruction processing diagram 500 shows the synchronization problem asinstruction 510 depending on a later executed instruction. As such, itis important for a processing element to synchronize the execution orderof dependent and dependee instructions between threads to avoid suchprogram errors.

The present invention provides a system and method that maintains theorder of instruction execution between threads. Generally, a processingelement processes multiple threads of instructions. Instructions in thethreads can include dependence indicators that indicate dependenciesbetween instructions and threads. When the processing element encountersinstructions that include dependence indicators identifying a dependentinstruction or thread, it checks, decrements, or increments one or moredependency counters. If the dependency counter is not above a threshold,it indicates that a dependency has not been satisfied, and theprocessing element can suspend the execution of a thread until thedependency counter is incremented to above the threshold. This allowsthe processing element to maintain a form of synchronized execution ofdependent instructions between threads.

In one embodiment, instructions can include the dependence indicators asbits, called “depends” bits and “tells” bits. A depends bit is anindicator in a dependent instruction that a particular other threadincludes an instruction on which this one depends. A tells bit is anindicator in a dependee instruction that a particular other threadincludes an instruction dependent on this one. The additional bits canbe included with the instruction in a number of ways. For example, acompiler for instruction-level parallel processors can include the bitsat compile time based on dependencies, or a programmer may specify theinstruction execution order by including “depends” and “tells” bits whencoding, etc.

An exemplary embodiment is described herein to provide context fordiscussion, and the present invention encompasses other embodiments, asare described further below. Consider an exemplary processing elementprocessing four threads of instructions. Each of the instructions in thefour threads can include depends bits and tells bits. In an exemplaryembodiment each instruction in a thread can include three depends bits,each of which indicates that the instruction is dependent on one of theother three threads. Similarly, each instruction in a thread can includethree tells bits, each of which indicates that one of the other threethreads depends on the execution of the instruction.

In the exemplary embodiment, the processing element can include fourgroups of dependency counters, each of which is associated with one ofthe four threads. Each of the groups of dependency counters includesthree individual dependency counters, each of which is associated withone of the other three threads. For instance, consider four exemplarythreads, thread 0, thread 1, thread 2, and thread 3, each having anassociated group of dependency counters. The exemplary group ofdependency counters associated with thread 0 includes three individualdependency counters, each of which is associated with one of thread 1,thread 2, or thread 3.

In operation, the exemplary processing element processes theinstructions of the four threads. When the exemplary processing elementencounters an instruction in a first thread that includes a tells bitidentifying a second thread (i.e., one of the other three threads), theexemplary processing element increments the dependency counterassociated with the first thread of the group of dependency countersassociated with the second thread.

When the exemplary processing element processes an instruction in afirst thread that includes a depends bit identifying a second thread,the processing element checks the dependency counter associated with thesecond thread of the group of dependency counters associated with thefirst thread to determine whether the instruction can be executed. Ifthe value of the exemplary dependency counter is above a threshold(e.g., non-zero), the processing element executes the instruction. If,on the other hand, the value of the exemplary dependency counter isbelow a threshold, processing of the first thread is inhibited. Theprocessing element increments the dependency counter when instructionsincluding tells bits in the second thread are executed, and processingthe first thread is resumed once the dependency counter is above thethreshold. Note that an instruction can include multiple dependencyindicators, such as one or more tells bits in combination with one ormore depends bits. When an instruction includes more than one dependsbit, the associated dependency counters must be above the thresholdbefore the instruction is executed.

The threshold is a dependency counter value chosen to ensure thatdependent instructions are not executed before the instructions in otherthreads on which they depend. The threshold value can be set to ensurecorrect instruction level synchronization. For example, the thresholdcan be chosen to be zero, so that a dependency counter must beincremented before a dependent instruction can be executed, as isdescribed in further detail, below. Network data element processing isoften repetitive and predictable. As such, a programmer, or compiler,can determine that value at which the threshold can be set. Note thatalthough one embodiment of the present invention is explained in termsof a “threshold,” “above a threshold,” and “not above a threshold,”other configurations that record dependency between instructions andthreads are possible. For example, in an alternate embodiment, theprocessing element can suspend processing a thread if a dependencycounter falls below a threshold.

According to an embodiment of the present invention, depends bits, tellsbits, and dependency counters are used to record the satisfaction ofdependencies between instructions in a first thread and the processingof a second thread. This is in contrast to instruction processingdiagram 500 of FIG. 5 that shows dependencies between individualinstructions. It is sufficient to record dependency at this levelbecause the present invention provides a system and method that ensuresthat dependent instructions are executed after the instructions on whichthey depend.

Consider, for example, the application of “depends” bits and “tells”bits to instruction processing diagram 500 of FIG. 5. In this example,instruction 512 would include a depends bit identifying instruction 512as dependent upon instructions in thread 502. In one embodiment, thedepends bit identifies the thread that includes the instruction on whichinstruction 512 is dependent, which is, in this case, thread 502. Inanother embodiment, the depends bits can identify the type or particularone of the instructions in thread 502. For example, the instruction caninclude more bits (i.e., more information) that identify instructioncharacteristics (such as type, priority, etc.). For descriptive clarity,however, depends bits and tells bits are described herein as identifyingthreads, and not instructions. As such, instruction 508 would include atells bit that identifies thread 504 as including an instruction orinstructions that are dependent upon the execution of instruction 508.

Similarly, instruction 510 would include a tells bit identifying thread506 as including instructions dependent upon the execution ofinstruction 510. Instruction 510 would also include a depends bitidentifying instruction 510 as dependent on the execution ofinstructions in thread 504. Instruction 514 would include a tells bitidentifying thread 502 as including instructions that are dependent onthe execution of instruction 514. Instruction 514 also would include atells bit identifying thread 506 as including instructions dependent onthe execution of instruction 514. Instruction 516 would include a bitidentifying instruction 516 as dependent on instructions in thread 502.Instruction 518 would include a depends bit identifying instruction 518as dependent on the execution of instructions in thread 504.

FIG. 8 illustrates an exemplary instruction, according to an embodimentof the present invention. Instruction 800 includes opcode 802, source0804 0 804, source 1806 1 806, result 808, depends bit 810, depends bit812, depends bit 814, tells bit 816, tells bit 818, and tells bit 820.Opcode 802 is the operator for instruction 800. Source 0804 0 804specifies a first operand operated upon by opcode 802. Source 1806 1 806specifies a second operand operated upon by opcode 802. Result 808identifies a register to which the results of opcode 1302 are stored.

Depends bits 810–814 indicate that instruction 800 depends upon theexecution of instructions in other threads. Instruction 800 isconfigured for a processing element that supports the operation of fourthreads. Note that although instruction 800 includes three depends bitswhich identify three other threads, and three tells bits, which alsoidentify three other threads, other configurations are possible. Byadding additional bits or changing how the bits are used, instruction800 can be configured for a processing element that supports more thanfour threads. Consider, for example, binary coding of depends bits810–814, and tells bits 816–818 816-820. In such an example, dependsbits 810–814 can represent up to eight other threads, extendinginstruction 800 to a processing element supporting nine threads.Similarly, additional depends and tells bits can be added as isnecessary for a given processing element architecture.

Consider, for example, the case in which instruction 800 is executing inthread 1. If instruction 800 is executing in thread 1, the other threethreads on which the execution of instruction 800 may depend are thread0, thread 2, and thread 3. In this case, depends bit 810 can identifyinstruction 800 as dependent on thread 0, depends bit 812 can identifyinstruction 800 as dependent on thread 2, and depends bid 814 canidentify instruction 800 as dependent on thread 3. Likewise, tells bit816 can identify thread 0 as dependent on instruction 800. Tells bit 818can identify thread 2 as dependent on instruction 800. Tells bit 820 canidentify thread 3 as dependent on instruction 800.

As suggested by the relationships described above, dependency countergroups are a set of dependency counters associated with each thread.Each of threads 502–506 of instruction processing diagram 500, forexample, would have, or be associated with, a dependency counter group.Each dependency counter group could include a number of individualdependency counters, each of which is associated with one of the otherthreads executing on the processing element. For example, the dependencycounter group associated with thread 502 of instruction processingdiagram 500 would include two dependency counters, one related to, orassociated with, thread 504, and one related to, or associated with,thread 506.

FIG. 7 illustrates exemplary dependency counter groups, according to anembodiment of the present invention. FIG. 7 shows four dependencycounter groups, each of which is associated with one of four threads.Dependency counter group 702 (T0) is associated with thread 0,dependency counter group 704 (T1) is associated with thread 1,dependency counter group 706 (T2) is associated with thread 2, anddependency counter group 708 (T3) is associated with thread 3. Each ofdependency counter groups 702–708 includes three dependency counters,each of which is associated with one of the other three threads.Dependency counter group 702 includes dependency counter T0 ₁,dependency counter T0 ₂, and dependency counter T0 ₃. Dependency counterT0 ₁ is that dependency counter of thread 0 that is related to, orassociated with, thread 1. Similarly, dependency counter T0 ₂ anddependency counter T0 ₃ are thread 0 dependency counters associatedwith, or related to, threads 2 and 3, respectively. In the same manner,dependency counter group 704 includes dependency counter T1 ₀,dependency counter T1 ₂, and dependency counter T1 ₃. Dependency counterT1 ₀ is associated with thread 0, dependency counter T1 ₂ is associatedwith thread 2, and dependency counter T1 ₃ is associated with thread 3.Also, dependency counter group 706 includes dependency counter T2 ₀,dependency counter T2 ₁, and dependency counter T2 ₃. Dependency counterT2 ₀ is associated with thread 0, dependency counter T2 ₁ is associatedwith thread 1, and dependency counter T1 ₃ is associated with thread 3.Dependency counter group 708 includes dependency counter T3 ₀,dependency counter T3 ₁, and dependency counter T3 ₂. Dependency counterT3 ₀ is associated with thread 0, dependency counter T3 ₁ is associatedwith thread 1, and dependency counter T3 ₂ is associated with thread 2.

Note that although four dependency counter groups are shown (as areimplemented in one embodiment to support four threads), and thedependency counter groups include three dependency counters each, otherconfigurations are possible. For example, greater or fewer than fourdependency counter groups can be used according to the number of threadsa processing element can execute concurrently. Additionally, dependencycounter groups 702–708 can include more or fewer dependency counters,depending on the processing element architecture.

Moreover, although the invention and illustrative examples are describedin terms of dependency counter groups, and dependency counters, otherconfigurations are possible. Consider, for example, bi-state, ortri-state elements substituted for dependency counters 702–708. Abi-state element associated with a first thread can be set when acorresponding dependee instruction in a second thread is executed, andreset when the dependent instruction is executed. In this example, aprocessing element suspends processing the first thread when itencounters an instruction including a depends bit if the bi-stateelement is not set. Similarly, tri-state elements, and other stateretaining elements can be set and reset by the processing element. Inthis embodiment, however, care should be taken to avoid overflowing thestate elements. For example, a bi-state element may be incremented, orchanged, only once in response to an instruction that includes a tellsbit.

Similarly, the implementation of the present invention should accountfor the size of the dependency counters to avoid overflow. Consider, forexample, the case in which multiple instructions including tells bitsidentifying one thread are executed. In such a case, it is possible tooverflow the dependency counter. Dependency counters, therefore, shouldbe specified large enough to ensure that overflow will never occur, orlimits should be set on the number of times a dependency counter can beincremented. For example, a first thread that includes many instructionsthat include tells bits identifying a second thread can be suspendedonce the dependency counter associated with the second thread hasreached a limit. The limit can ensure that the dependency counter doesnot overflow, and can also ensure that a dependee thread does not gettoo far ahead of a dependent thread.

In operation, a tells bit affects one or more dependency counters of thethreads other than the one on which the tells bit appears. By contrast,a depends bit affects one or more dependency counters of the thread onwhich the depends bit appears. Thus, when the processing element detectsa first instruction in a first thread as including a tells bit thatidentifies a second thread, the processing element increments one of thedependency counters in the dependency counter group of the secondthread. In particular, it increments that dependency counter of thesecond thread that is associated with the first thread. Consider, forexample, the case in which thread 1 is executing a stream ofinstructions. One of the instructions in thread 1 includes a tells bitthat identifies thread 0. In response to the tells bit, the processingelement increments the particular dependency counter in dependencycounter group 702 associated with thread 1. In the example of dependencycounter group 702, dependency counter T0 ₁, is associated with thread 1.The processing element, therefore, increments T0 ₁ of dependency countergroup 704 when the thread 1 instruction tells bit is detected.Similarly, when the processing element detects an instruction in athread that includes a depends bit, the dependency counters are checked,and the processing element either suspends the dependent thread orexecutes the instruction and decrements the associated dependencycounter.

For example, thread 1 can include an instruction that includes a dependsbit that identifies the instruction as depending on the execution ofthread 0. In this case, when the processing element detects the dependsbit, the dependency counter associated with thread 0 of the dependencycounter group associated with the thread 1 is checked. In this case,dependency counter T1 ₀ of dependency counter group 704 is associatedwith thread T0. Depending on the value of dependency counter T1 ₀, theprocessing element either suspends processing thread 1 or bothdecrements T1 ₀ and continues processing the thread 1, thereby executingthe instruction. Once suspended, the processing element resumesprocessing thread 1 when dependency counter T1 ₀ is incremented by theprocessing element (i.e., when an instruction in thread 0 with a tellsbit is executed).

FIG. 6 illustrates concurrent processing of two threads of instructions,according to an embodiment of the present invention. Threadsynchronization diagram 600 shows thread 602 and thread 604 as a seriesof processing steps. A processing step is an action or actions performedby a processing element in the implementation of one embodiment of thepresent invention. A processing step can be, for example, the executionof an instruction, incrementing a dependency counter, decrementing adependency counter, etc. Thread 602 includes processing step 606,processing step 608, processing step 610, processing step 612,processing step 614, and processing step 616. Thread 604 includesprocessing step 618, processing step 620, and processing step 622.Although synchronization diagram 600 only shows two threads ofinstructions, other configurations are possible. For example, the systemand method of the present invention can be extended to three, four, andmore than four threads, as described above.

For the purpose of descriptive clarity, the instructions of threadsynchronization diagram 600 are referred to as instruction 508 (i1),instruction 510 (i2), instruction 512 (i3), and instruction 514 (i4).Note, however, that instruction processing diagram 500 shows instruction512 as dependent on instruction 508, and shows instruction 510 asdependent on instruction 514. Thread synchronization diagram 600, on theother hand, shows the instructions of thread 602 dependent on theexecution of instructions in thread 604 generally, and the instructionsof thread 604 dependent on the execution of instructions in thread 602generally. The dependencies between instructions 508–514 shown ininstruction processing diagram 500 are implemented in the operation ofone embodiment of the present invention through the general dependencyof instructions within one thread on the processing of another thread(i.e., rather than particular instructions). This concept is illustratedin further detail below.

Additionally, thread synchronization diagram 600 shows tells bits 624and 630 and depends bit 626 and 628 as arrows pointing from processingsteps to the threads that the bits identify. The arrows are shown toindicate that an instruction being processed in a processing stepincludes a tells bit or depends bit, and identifies the thread to whichthe bit points. Either the thread pointed to depends on the instruction(i.e., tells bit), or the instruction depends on the thread (i.e.,depends bit). For example, tells bit 624 identifies thread 604 asdependent on instruction 508 of processing step 606. Similarly, dependsbit 626 identifies instruction 512 of processing step 618 as dependenton thread 602.

Processing of thread 602 and thread 604 begins when the processingelement executes instruction 508, in processing step 606. Instruction508 includes tells bit 624 that identifies thread 604 as dependent oninstruction 508. The processing element detects tells bit 624 andincrements a dependency counter in a dependency counter group 704associated with thread T1, 604.

As described above, a dependency counter group is associated with athread, and the dependency counter group includes dependency counters,each of which is associated with one of the other threads executing onthe processing element. Thread synchronization diagram 600 is describedin terms of dependency counter group 702 (associated with thread 602)and dependency counter group 704 (associated with thread 604).Dependency counter T₁₀ T1₀ of dependency counter group 704 is associatedwith thread 602, and dependency counter T0, is associated with thread602.

After processing step 606, the processing element receives instruction512, in processing step 618. Instruction 512 includes depends bit 626identifying instruction 512 as dependent on the execution ofinstructions in thread 602. The processing element determines ifdependency counter T1 ₀ is above a predefined threshold. For thepurposes of explanation, dependency counter T1 ₀ is assumed to have beenabove, or at the threshold, so that it is above the threshold afterbeing incremented. Since the processing element has incrementeddependency counter T1 ₀, when the dependency counter is checked inresponse to instruction 512, the processing element determines thatdependency counter T1 ₀ is above the threshold.

Since dependency counter T1 ₀ is above the threshold, the processingelement continues processing instruction 512, at processing step 620. Inprocessing step 620, the processing element executes instruction 512 anddecrements dependency counter T1 ₀.

Meanwhile, the processing element processes thread 602 in processingstep 608. In processing step 608, the processing element receivesinstruction 510 from program memory. Instruction 510 includes dependsbit 628, which identifies instruction 510 as dependent on the executionof instructions in thread 604. The processing element checks thedependency counter group of thread 602, particularly the dependencycounter related to thread 604, in response to detecting depends bit 628.This corresponds to dependency counter T0 ₁. The value can be, forexample, zero, or some other number representing a predeterminedthreshold. For exemplary purposes, however, dependency counter T0 ₁ isdefined as having a value of the predetermined threshold. In any case,the value of dependency counter T0 ₁ indicates that instructions inthread 604 upon which instruction 510 depends, have not yet beenexecuted. In response to detecting that dependency counter T0 ₁ is notabove a threshold, the processing element suspends execution of thread602 in processing step 610.

Meanwhile, the processing element continues processing thread 604. Theprocessing element receives instruction 514 in processing step 622.Instruction 514 includes tells bit 630 that identifies thread 602 asincluding instructions dependent on instruction 514. The processingelement increments that dependency counter of the thread 602 dependencycounter group that is related to thread 604 (namely, dependency counterT0 ₁) in response to detecting tells bit 630, and executes instruction514, in processing step 622. Note that the order of executing theinstruction and incrementing or decrementing dependency counters ischosen for illustrative purposes only, and the same outcome can beachieved with reversed order.

After processing step 622, the processing element detects thatdependency counter T0 ₁ has been incremented to above the threshold, inprocessing step 612. As such, the processing element resumes processingthread 602 at instruction 510 in processing step 614. After resumingprocessing thread 602, the processing element executes instruction 510,decrements dependency counter T0 ₁, and continues processing theinstructions of thread 602, in step 616. Note that in the example ofFIG. 6, dependency counter T0 ₁ is now equal to the threshold value, andany additional instructions in thread 602 that include depends bitsidentifying thread 604 will cause the processing element to suspendexecution of the thread (absent prior instructions in thread 604 withtells bits identifying thread 602).

The operation of thread synchronization diagram 600 is now describedwith reference to the elements of exemplary processing element 400. Theexecution of thread 602 begins in processing step 606. For descriptiveclarity, thread 602 is associated with instruction buffer 404A, andthread 604 is associated with instruction buffer 404B. In general,instruction fetch unit 402 fetches program instructions from programmemory 306. Instruction fetch unit 402 distributes the instructionsassociated with the four threads to one of instruction buffers 404A,404B, 404C, or 404D. In one embodiment, each of instruction buffers404A–404D is associated with a particular thread.

Instruction issue control 408 detects the presence of depends bits suchas depends bits 810–814 or the presence of tells bits, such as tellsbits 816–820 included in instructions in instruction buffers 404A–404D.Based on presence or absence of depends bits and tells bits in theinstruction, instruction issue control 408 controls function decode andexecution switch 406. Based on signals from instruction issue control408, function decode and execution switch 406 issues instructions frominstruction buffers 404A–404D to one of execution units 412–416 (i.e.,memory peripheral interface unit 412, primary function unit 414, orauxiliary function unit 416).

In processing step 606, instruction 508 is received in instructionbuffer 404A. Instruction issue control 408 detects the presence of tellsbit 624 in instruction 508. In response to detecting the presence oftells bit 624, instruction issue control increments one of thedependency counters in dependency counters 410. As described above,instruction issue control 408 increments dependency counter T1 ₀.Instruction issue control 408 then causes function decode and executionswitch 406 to provide instruction 508 to one of execution units 412–416for execution. Meanwhile, processing element 400 is also processingthread 604. Instruction buffer 404B receives instruction 512, inprocessing step 618. Instruction issue control 408 detects the existenceof depends bit 626 in instruction 512. Depends bit 626 identifiesinstruction 512 as dependent on instructions in thread 602. In responseto detecting depends bit 626, instruction issue control 408 checksdependency counter T1 ₀ in processing step 618. Since dependency counterT1 ₀ is above the threshold (as described above), instruction issuecontrol 408 enables function decode and execution switch 406 to provideinstruction 512 to one of execution units 412–416 for execution.Additionally, instruction issue control decrements dependency counter T1₀ in dependency counters 410.

Meanwhile, processing element 400 receives instruction 510 in processingstep 608. Instruction issue control 408 detects the existence of dependsbit 628 in instruction buffer 404A. Depends bit 628 identifiesinstruction 510 as dependent on instructions in thread 604. In responseto detecting depends bit 628, instruction issue control 408 checksdependency counter T0 ₁ in dependency counters 410. In this particularexample, dependency counter T0 ₁ is equal to the threshold necessary tocontinue processing instruction 510. Since dependency counter T0 ₁ isnot above the threshold, instruction issue control 408 suspendsexecution of thread 602 by holding instruction 510 in function decodeand execution switch 406.

Processing element 400 continues processing thread 604, and receivesinstruction 514 in processing step 622. Instruction 514 includes tellsbit 630 identifying thread 602 as dependent on the execution ofinstruction 514. Instruction issue control 408 increments dependencycounter T0 ₁ in response to detecting tells bit 630, in processing step622. Instruction issue control 408 causes function decode and executionswitch 406 to send instruction 514 to one of execution units 412–416 forexecution. After dependency counter T0 ₁ has been incremented inprocessing step 622, instruction issue control 408 detects thatdependency counter T0 ₁ has been incremented. Instruction issue control408 checks dependency counter T0 ₁ to determine if it is above thethreshold. In the example of thread synchronization diagram 600,instruction issue control 408 determines that dependency counter T0 ₁ isabove the threshold, in processing step 612. In response to detectingdependency counter T0 ₁ above the threshold, instruction issue control408 resumes processing thread 602 by issuing instruction 510 to one ofexecution units 412–416, in processing step 614. Instruction 510 isexecuted, and instruction issue control 408 decrements dependencycounter T0 ₁ in processing step 616.

FIG. 9 illustrates a process for executing instructions, according to anembodiment of the present invention. After method 900 starts in step902, a processing element receives an instruction in a first thread, instep 904. In step 906, the processing element determines if theexecution of the instruction in the first thread is dependent on theexecution of instructions in a second thread.

If the processing element determines that the execution of theinstruction in the first thread is not dependent on the execution ofinstructions in a second thread, method 900 ends in step 916.

If, on the other hand, the processing element determines that theexecution of the instruction in the first thread is dependent on theexecution of instructions in a second thread, the process of method 900continues in step 908. In step 908, the processing element examines adependency counter group that includes a dependency counter associatedwith the second thread.

In step 910, the processing element determines whether the dependencycounter includes a value above a threshold. If the dependency counterincludes a value above a threshold, method 900 continues in step 914. Instep 914, the processing element executes the first thread instructionand decrements the dependency counter.

If, on the other hand, the processing element determines that thedependency counter does not include a value above a threshold, method900 continues in step 912. In step 912, the processing element suspendsexecution of the first thread until the dependency counter isincremented to above a threshold. Once the dependency counter isincremented to above a threshold, processing the first thread resumes,method 900 continues in step 914. In step 914, the processing elementexecutes the first thread instruction.

FIG. 10 illustrates an exemplary process for executing instructions,according to an embodiment of the present invention. After method 1000starts in step 1002, a processing element receives a first threadinstruction, in step 1004. After the first thread instruction has beenreceived, the processing element determines whether a second thread isdependent on the first thread instruction, in step 1006.

If a second thread is dependent on the execution of the first threadinstruction, method 1000 continues in step 1008. In step 1008, theprocessing element increments a dependency counter included in adependency counter group associated with the second thread. After thedependency counter is incremented, the processing element executes thefirst thread instruction, in step 1010.

If, on the other hand, the processing element determines that a secondthread is not dependent on the first thread instruction, the process ofmethod 1000 continues in step 1010. In step 1010, the processing elementexecutes the first thread instruction.

After step 1010, method 1000 ends in step 1012.

The present invention provides a system and method for high speedprocessing of network data elements. A network line module, such asnetwork line module 104 ₁, receives network data elements from a networkor switch fabric via a network line module ingress port. The networkline module provides the network data elements to a multiprocessor core.The received network data elements are distributed to multipleprocessing elements within the multiprocessor core for processingaccording to a program.

The processing elements process the network data elements according toprogram instructions stored in program memory. Each of the processingelements uses instruction-level parallelism to process multiple threadsof instructions concurrently. Instruction execution is synchronized byrecording dependencies between instructions and threads. Instructions inthe threads can include dependence indicators identifying dependenciesbetween instructions and threads. When a processing element encountersan instruction that includes dependence indicators identifying adependent instruction or thread, the processing element checks,decrements, or increments one or more dependency counters that recordsdependency between instructions and threads. If an instruction in afirst thread is dependent upon the execution of instructions in a secondthread, a dependency counter is checked. If the dependency counter isnot above a predetermined threshold, the processing element suspends theexecution of the first thread until the dependency counter isincremented by the second thread to above the threshold.

After processing, the multiprocessor core provides processed networkdata elements to the network line module. The network line moduleprovides the processed network data element to an egress port connectedto a network or switch fabric.

It will be apparent to one skilled in the art that various changes andmodifications can be made therein without departing from the spirit andscope thereof. Thus, it is intended that the present invention cover themodifications and variations of this invention provided they come withinthe scope of the appended claims and their equivalents.

What is claimed is:
 1. An apparatus for instruction-level parallelism ina processing element, comprising: an instruction control unit; Theapparatus of claim 7, wherein the instruction buffer comprises: a firstinstruction buffer coupled to said instruction control unit, the firstinstruction buffer configured to hold a the first instructionincluding adependency indicator and being associated with a first thread;, and asecond instruction buffer coupled to said instruction control unit, thesecond instruction buffer configured to hold a the secondinstructionincluding a dependency indicator and being associated with asecond thread; a dependency counter coupled to said instruction controlunit;; and wherein anthe execution switch is coupled to said instructioncontrol unit, said first instruction buffer, and said second instructionbuffer; and an execution unit coupled to said execution switch; saidinstruction control unit configured to detect the dependency indicatorsand change the value of said dependency counter in response to detectingthe dependency indicators and configured to disallow execution of thefirst instruction if said dependency counter includes a value less thana threshold value.
 2. The apparatus of claim 1, wherein said dependencycounter includes a first counter associated with the first instructionbuffer and a second counter associated with the second instructionbuffer.
 3. The apparatus of claim 1, wherein said instruction controlunit identifies instruction dependency bits in said first instructionbuffer, the instruction dependency bits being associated withinstructions.
 4. The apparatus of claim 1, said instruction control unitgenerating control signals based on the dependency bits and valuesincluded in said dependency counter.
 5. The apparatus of claim 4, saidexecution switch providing instructions from said first instructionbuffer to said execution unit based on control signals from saidinstruction control unit.
 6. The apparatus of claim 1, said executionswitch providing instructions from said first instruction buffer to saidexecution unit based on control signals from said instruction controlunit.
 7. An apparatus for processing instructions in multiple threads inan execution unit, comprising: an instruction buffer holding configuredto hold a first instruction and a second instruction, the firstinstruction being associated with a first thread, and the secondinstruction being associated with a second thread, the first instructionand the second instruction including one or more instruction dependencybits; a dependency counter, an instruction control unit coupled to saidinstruction buffer and said dependency counter, said instruction controlunit detecting configured to detect the instruction dependency bits andincrementing and decrementing to increment and decrement said dependencycounter in response to detecting the instruction dependence bits, saidinstruction control unit configured to disallow execution of the firstinstruction if in response to said dependency counter includes includinga value less than a threshold value; and an execution switch coupled tosaid instruction control unit and said instruction buffer, saidexecution switch sending configured to send instructions to theexecution unit.
 8. The apparatus of claim 7, wherein said dependencycounter includes a first counter associated with the first thread and asecond counter associated with the second thread.
 9. The apparatus ofclaim 7, wherein said instruction buffer includes the instructiondependency bits, the instruction dependency bits being associated withinstructions.
 10. The apparatus of claim 7, wherein said instructioncontrol detects dependency between the first instruction and the secondthread based on dependency bits in said instruction buffer and a valueof said dependency counter.
 11. A method for processing instructions inmultiple threads, comprising: receiving a first instruction associatedwith a first thread; determining whether execution of the firstinstruction depends on execution of a second instruction, the secondinstruction being associated with a second thread; examining a counterlogic element associated with the first thread if in response to saiddetermining indicates indicating that the first instruction depends onthe execution of the second instruction, wherein the logic elementcomprises a single bi-state element or a tri-state element; decrementingthe counter ifmodifying the logic element in response to said examiningindicatesindicating that the second instruction has already beenexecuted; and executing the first instruction; and suspending theprocessing of the first thread until said examining indicates that thesecond instruction has already been executed and then resumingprocessing.
 12. The method of claim 11, further comprising suspendingthe processing of the first thread until said examining indicates thatthe second instruction has already been executed.
 13. A method forprocessing instructions in multiple threads, comprising: receiving afirst instruction associated with a first thread; determining whetherexecution of a second instruction depends on the execution of the firstinstruction, the second instruction being associated with a secondthread; incrementing a counter associated with the second thread if inresponse to said determining indicates indicating that execution of asecond instruction depends on the execution of the first instruction;and executing the first instruction; and suspending the processing ofthe second thread in response to the counter associated with the secondthread not exceeding a threshold and resuming the processing of thesecond thread in response to the counter associated with the secondthread exceeding the threshold; wherein the first instruction and thesecond instruction include one or more instruction dependency bits. 14.The method of claim 13, further comprising suspending the processing ofthe second thread if the counter associated with the second thread doesnot exceed a threshold.
 15. A method for processing instructions inmultiple threads, comprising: receiving a first instruction associatedwith a first thread, the first instruction including one or moreinstruction dependency bits; determining whether a second thread dependson said first instruction; incrementing a counter associated with thesecond thread if in response to the second thread depends depending onsaid first instruction; loading a second instruction associated with asecond thread; and processing the second instruction in a manner relatedto the value of the counter associated with the second thread; andsuspending, the processing of the second thread in response to thecounter not exceeding a threshold and resuming the processing of thesecond thread in response to the counter exceeding said threshold. 16.The method of claim 15, further comprising suspending the processing thesecond thread if the counter indicates that a dependent thread has notbeen executed.
 17. The method of claim 15, further comprising executingthe second instruction if the counter indicates that said firstinstruction has been executed.
 18. An apparatus for processinginstructions in multiple threads, comprising: an instruction bufferconfigured to hold a first instruction and a second instruction, thefirst instruction including a dependency indicator and being associatedwith a first thread, and the second instruction including a dependencyindicator and being associated with a second thread; an instructioncontrol unit coupled to said instruction buffer; a dependency countercoupled to said instruction control unit, said dependency counterassociated with the first thread; said instruction control unitconfigured to detect the dependency indicators and change the value ofincrement and decrement said dependency counter in response to detectingthe dependency indicators; and said instruction control unit configuredto disallow execution of the first instruction if in response to saiddependency counter includes including a value less than a thresholdvalue.
 19. The apparatus of claim 18, wherein said instruction controlunit is configured to determine that the dependency indicator includedin the first instruction indicates that the second thread includes aninstruction on which the first instruction depends.
 20. The apparatus ofclaim 18, wherein the dependency indicator included in the firstinstruction is a depends bit.
 21. The apparatus of claim 18, whereinsaid instruction control unit is configured to determine that thedependency indicator included in the second instruction indicates thatthe first thread includes an instruction that is dependent on the secondinstruction.
 22. The apparatus of claim 18, wherein the dependencyindicator included in the second instruction is a tells bit.
 23. Theapparatus of claim 18 25, wherein said instruction control unit isconfigured to increment said dependency counter in response to detectingthe dependency indicator included in the second instruction.
 24. Theapparatus of claim 18 25, wherein said instruction control unit isconfigured to decrement said dependency counter in response to detectingthe dependency indicator included in the first instruction.
 25. Theapparatus according to claim 18, wherein said dependency counter iscoupled to said instruction control unit and associated with the firstthread.
 26. A method for processing instructions in multiple threads,comprising: receiving a first instruction associated with a firstthread; determining that execution of the first instruction depends onexecution of a second instruction, the second instruction beingassociated with a second thread; examining a dependency counterassociated with the first thread to determine whether the secondinstruction has already been executed; incrementing the dependencycounter in response to said determining indicating that execution of thefirst instruction depends on execution of the second instruction; andsuspending the processing of the first thread when examining indicatesthat the dependency counter does not exceed a threshold and resuming theprocessing after the dependency counter exceeds said threshold, whereinthe first instruction and the second instruction include one or moreinstruction dependency bits.